ParaWeaver: Performance Evaluation on Programming Models for Fine Grained Threads
نویسندگان
چکیده
There is a trend towards multicore or manycore processors in computer architecture design. In addition, several parallel programming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Others construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms impact performance remains an open question. Implicitly constructed fine grained threads exhibit more overhead due to additional thread scheduling, thread communication, and thread context switches. However, they also increase the flexibility in scheduling. Therefore, computation resources can be utilized further and workloads are more balanced among cores. Moreover, if scheduled properly, concurrent fine grained threads may exhibit more data affinity than coarse grained threads. In most parallel architectures, the lastlevel cache is typically shared among all the cores. Therefore, it is exposed to contention and pollution due to concurrent threads. As a result, data sharing becomes important. A greater degree of data sharing among threads results in fewer last-level cache misses, which is one of the main latencies for a multithreaded process. The data-sharing behavior among the threads depends on how the applications are parallelized and how the threads are scheduled. The complex nature of many applications leads to nested structures in the call graph, and concurrency can be found from a course grained level to a fine grained level. In this project, we compare the data sharing behavior of coarse grained threads and fine grained threads, and evaluate their performance on a CMP cache simulator.
منابع مشابه
A Performance Evaluation of Fine- Grain Thread Migration with Active Threads
Thread migration is established as a mechanism for achieving dynamic load sharing and data locality. However, migration has not been used with fine-grained parallelism due to the relatively high overheads associated with thread and messaging packages. This paper describes a high performance thread migration system for fine-grained parallelism, implemented with user level threads and user level ...
متن کاملHandling Massive Parallelism Efficiently: Introducing Batches of Threads
Emerging parallel architectures provide the means to efficiently handle more fine-grained and larger numbers of parallel tasks. However, software for parallel programming still does not take advantage of these new possibilities, retaining the high cost associated with managing large numbers of threads. A significant percentage of this overhead can be attributed to operations on queues. In this ...
متن کاملExtracting parallelism in OS kernels using type-safe languages
Operating system kernels are rife with potential concurrency, but exploiting this parallelism requires significant effort from programmers that write kernel code in C: the language provides little help for creating transient threads or packaging up arguments in closures, and fine-grained concurrency forces the programmer to carefully reason about what memory might be used by each thread and whe...
متن کاملHardware/Software Techniques for Assisted Execution Runtime Systems
The increasing complexity of modern and future multi-core/multithreaded processors rises the question of how to best utilize processor resources. On one side, Amdahl’s Law limits the maximum theoretical speedup of parallel applications while, on the other side, the increasing complexity of runtime programming language may introduce implicit serialization points. Several studies demonstrated tha...
متن کاملParallel Combining: Making Use of Free Cycles
There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the shared data in parallel and the cost of synchronization. It has been observed that for a class of “concurrencyaverse” data structures, the use of fine-grained locking for parallelization does not pay off: an implementation based on a single global lock outperforms fine...
متن کامل